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At the start of a data analysis project it is often suggested that the researcher look first 
at multiple simple models. That is, always begin with simple, one variable at a time 
analyses, such as multiple single-variable tests for association or significance. Then, 
later, somehow (how?) pull all the separate pieces together into a single comprehensive 
framework, an inclusive data narrative. For detecting true compound effects with more 
than just marginal associations, this is easily defeated with simple examples as has been 
recently highlighted in BioData Mining [1]. But more critically, it is looking through 
the data telescope from the wrong end. 

It is our experience that first sifting, rigorously, carefully through complex data, as 
truly complex, is more efficient as a first step. After which the researcher can work to 
locate the simple models, the layers of the story. There are several interconnected 
issues to this problem and this approach. 

First, as researchers ourselves we find it more efficient to use a fast filter to detect 
the presence of any signal, and then use these indications to fit smaller models. Many 
methods are available for doing this sensibly and quickly. Some of these are fully 
nonparametric statistical learning machines [2] and others are more parametric based 
schemes [3]. In any of these approaches, discovery is the initial goal. 

Second, this approach is sometimes faulted on the grounds that results from a fast 
signal detector are impossible to interpret and hence are of little help. It is certainly 
true that learning machine output can be hard to understand if we stop with just the 
declaration of signal present, or signal not present. Yet many learning machine 
methods are available for sorting the results and then getting down to simple models. 

Third, it is also true that multiple small or large models can be equally compelling 
and all more or less correct: the models of Nature are strongly underdetermined. 
Confirmation after discovery, after multiple disclosures should be reinforcing across 
domains and models. A single tiny best model is a rare event, not impossible, but also 
not efficiently considered as a likely and reachable endpoint. 

Fourth, is this, that as terms, simple and complex are not simple and well-formed. 
Galileo believed that the arc of a water fountain was a catenary, but this is not correct: 
it is a parabola. On the other hand, the best supporting building arch is a catenary. 
And functionally a parabola is simpler than a catenary. Further, it was once thought 
that if planets did orbit around the Sun then they must do so in circles, since anything 
else, such as ellipses, would be too complex. This is not correct either. Note in these 
examples the detectors and data at the time were not sufficient to separate these 
models, confirming one, discounting another, and locating the best context for each 
model. But keeping all models available as working hypotheses might have promoted 

© 2014 Malley and Moore; licensee BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative 
Commons Attribution License (http://creativecommons.Org/licenses/by/4.0), which permits unrestricted use, distribution, and 
reproduction in any medium, provided the original work is properly credited. The Creative Commons Public Domain Dedication 
waiver (http://creativecommons.Org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise 
stated. 



Malley and Moore BioData Mining 2014, 7:13 
http://www.biodatamining.Org/content/7/1/13 



Page 2 of 2 



critical distinctions: consider using a catenary for building arches, but don't patiently 
wait for a water fountain to assume the shape of a catenary. Galileo's genius can be 
validated in other ways. 

Summarizing, having a tool is often not the problem. It is knowing when to use it, 
and then which end to use. 
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